Building Synthetic Voices 

Limited domain synthesis 
Computer Speech Processing Group Project PartA 
Due : In next class 



Telling the ti 




Festival includes a very simple*little script that speaks the current time 

(@file{festival/examples/saytimeV)>rhis section explains how to replace the synthesizer used 
from this script with one that talks *nth your own voice. This is an extreme example of a limited 
domain synthesizer but it is a good example as it allows us to give a walkthrough of the stages 
involved in building a limited domairf^ynthesizer. This example is also small enough that it can 
be done in well under an hour. 

Following through this example will give avraasonable understanding of the relative importance 
of many important steps in the voice buildingjMjcess. 

V 

The following tasks are required: * 

• Designing the prompts v* 

• Customized the synthesizer front end /K 

• Recording the prompts ^ > 

• Autolabeling the prompts *^>C 

• Building utterance structures for recorded utterances 

• Extracting pitchmark and building LPC coefficients^) 

• Building a clunit based synthesizer from the utterance* » 

• Testing and tuning X_) 

Before starting set the environment variables festvoxdir and est@r to the directories which 
contain the festvox distribution and the Edinburgh Speech Tools respectively. Under bash and 
other good shells this may be done by commands like 

export FESTVOXDIR=/home/awb/projects/festvox s~\ 
export ESTDIR=/home/awb/projects/1.4.3/speech_tools ^ 

In earlier releases only offered a command line based method for building vofces and limited 
domain synthesizers. In order to make the process easier and less prone to errorw^have 
introduced and graphical front end to these scripts. This front end is called pointWlicky (as it 
offers a pointy-clicky interface). It is particularly useful in the actual prompting and^ecording. 
Although pointyciicky is the recommend route in the section we go through the process step 
by step to give a better understanding of what is required and where problems may lie that 
require attention. 



A simple script is provided setting up the basic directory structure and copying in some default 
parameter files. The festvox distribution includes all the setup for the time domain. When 
building for your domain, you will need to provide the file etc/DOMAiN . data contains your 
prompts (as described below). 

mkdir -/data/time <^ 
cd -/data/time 

$FESTVOXDIR/src/ldom/seJj^D_ldom emu time awb 

As in the definition of diphone*daiabases we require three identifiers for the voice. These are 
(loosely) institution, domain andypg&ker. Use net if you feel there isn't an appropriate institution 
for you, though we have also use the project name that the voice is being build for here. The 
domain name seems well defined. Forspeaker name we have also used style as opposed to 
speaker name. The primary reason for^Hese to so that people do not all build limited domain 
synthesizer with the same thus making <&ot possible to load them into the same instance of 
festival. (\) 

^- 

This setup script makes the directories and ceipies basic scheme files into the festvox/ 
directory. You may need to edit these files later^S 

Designing the prompts 

In this saytime example the basic format of the utterance is 

The time is now, EXACTNESS MINUTE INFO, in the QAVPART. 

For example 

•d 

The time is now, a little after five to ten, in the morning. ^> 

In all there are 1 152 (4x12x12x2) utterances (although there are >ossible day info parts 
(morning, afternoon and evening) they only get 12 hours, 6 hours arterLhours respectively). 
Although it would technically be possible to record all of these we wissto reduce the amount of 
recording to a minimum. Thus what we actually do is ensure there is aMsast one example of each 
value in each slot. 

Here is a list of 24 utterances that should cover the main variations. • 

o 

The time is now, exactly five past one, in the morning Q 

The time is now, just after ten past two, in the morning 

The time is now, a little after quarter past three, in the morning 

The time is now, almost twenty past four, in the morning 

The time is now, exactly twenty-five past five, in the morning 

The time is now, just after half past six, in the morning 

The time is now, a little after twenty-five to seven, in the morning 

The time is now, almost twenty to eight, in the morning 
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r ten to two, in the afternoon 



2 

These examples are first put in the prompt file^vith an utterance number and the prompt in 
double quotes like this. 



(timeOOOl "The time is now ...") 
(time0002 "The time is now ...") 
(time0003 "The time is now ...") 



These prompt should be put into etc/DOMAiN . data. This<fi5e is used by many of the following 
sub-processes. (^) 



Recording the prompts 



Record a few examples on the PC to see how much noise is being 
example try the following 

$ESTDIR/bin/na_record -f 16000 -time 5 -o test.wav -otype riff 



^d up by the mike. For 

o. 



This will record 5 seconds from the microphone in the machine you run the(^mmand on. You 
should also do this to test that the microphone is plugged in (and switched on^ Play back the 
recorded wave with na_piay and perhaps play with the mixer levels until you gStyfhe least 
background noise with the strongest spoken signal. Now you should display the \^ayeform to see 
(as well as hear) how much noise is there. *^ 

$FESTVOXDIR/src/general/display_sg test.wav 



This will display the waveform and its spectrogram. Noise will show up in the silence (and 
other) parts. 



There a few ways to reduce noise. Ensure the microphone cable isn't wrapped around other 
cables (especially power cables). Turning the computer 90 degrees may help and repositioning 
things can help too. Moving the sound board to some other slot in the machine can also help as 
well as getting a different microphone (even the same make). 

There is a large advantaggdn recording straight to disk as it allows the recording to go directly 
into right files. Doing off-lj^recording (onto DAT) is better in reducing noise but transferring it 
to disk and segmenting it is 3^™g an d tedious process. 

Once you have checked your recording environment you can proceed with the build process. 
First generate the prompts with thevcommand 

festival -b festvox/build_ldom.scm '(buMd_prompts "etc/time.data")' 
and prompt and record them with the command 
bin/prompt_them etc/time.data 

You may or may not find listening to the prompt^before speaking useful. Simply displaying 
them may be adequate for you (if so comment out ^ na_piay line in bin/prompt_them} . 

Autolabeling the prompts ^ 

The recorded prompt can be labeled by aligning them agamst the synthesize prompts. This is 
done by the command 

bin/make_labs prompt- wav/*.wav 

If the utterances are long (> 10 seconds of speech) you may require4ots of swap space to do this 
stage (this could be fixed). 

Once labeled you should check that they are labeled reasonable. The la^Jer typically gets it 
pretty much correct, or very wrong, so a quick check can often save tirneJater. You can check 
the database using the command 

emulabel etc/emu_lab • 

o 

Once you are happy with the labeling you can construct the whole utterance struq@re for the 
spoken utterances. This is done by combining the basic structure from the synthesized prompts 
and the actual times from the automatically labeled ones. This can be done with the Command 

festival -b festvox/build_ldom.scm '(build_utts "etc/time.data")' 



Extracting pitchmarks and building LPC coefficients 



Getting good pitchmarks is important to the quality of the synthesis, see the Section called 
Extracting pitchmarks from waveforms in the Chapter called Basic Requirements for more 
detailed discussion on extrating pitchmarks from waveforms. For the limited domain 
synthesizers the pitch extract is a little less crucial that for diphone collection. Though spending a 
little time on this does help. 

If you have recorded EGGjj|grials the you can use bin/make_pm from the . lar files. Note that 
you may need to add (or remg^e) the option -inv depending on the updownness of your EGG 
signal. However so far only the CSTR larygnograph seems to produce inverted signals so the 
default should be adequate. Also note the parameters that specify the pitch period range, -rain 
and max the default setting are surtable for a male speaker, for a female you should modify these 
to something like >r . 

-min 0.0033 -max 0.0875 -def 0.005 V 

CO 

The changing from a range of (male) 200® -80Hz with a default of 100Hz, to a female range of 
300Hz- 1 20Hz and default of 200Hz. 

If you don't have an EGG signal you must extrfte) the pitch from the waveform itself. This works 
though may require a little modification of paran^ers, and it is computationally more expensive 
(and wont be as exact as from an EGG signal). The*& are two methods, one using Entropic's 
epoch program which work pretty well without tuning parameters. The second is to use the free 
Speech Tools program pitchmark. To use epoch use tfe^ program 

bin/ make_pm_epoch wav/* . wav \* 



To use pitchmark use the command 
bin/make_pm_wave wav/*. wav w . 

V 

As with the EGG extraction pitchmark uses parameters to specify(3)e range of the pitch periods, 
you should modify the parameters to best match your speakers rangeC^fche other filter parameters 
also can make a different to the success. Rather than try to explain wh^fehanging the figures 
mean (I admit I don't fully know), the best solution is to explain what y^^ieed to obtain as a 
result. r\ 

Irrespective of how you extract the pitchmarks we have found that a post-processing stage that 
moves the pitchmarks to the nearest peak is worthwhile. You can achieve this b^ 

bin/make_pm_fix pm/*.pm 

At this point you may find that your waveform file is upside down. Normally this wouldn't 
matter but due to the basic signal processing techniques we used to find the pitch periods upside 
down signals confuse things. People tell me that it shouldn't happen but some recording devices 
return an inverted signal. From the cases we've seen the same device always returns the same 



form so if one of your recordings is upside down all of them probably are (though there are some 
published speech databases e.g. BU Radio data, where a random half are upside down). 



In general the higher peaks should be positive rather than negative. If not you can invert the 
signals with the command 

for i in wav/* .wav A 

\ 

ch_wave -scale -1.0 $i -o $i 
done * 

If they are upside, invert them and *e-run the pitch marking. (If you do invert them it is not 
necessary to re-run the segment labe$ng») 

Power normalization can help too. ThisQun be done globally by the function 
bin/simple_powernormalize wav/* .wav *0 

This should be sufficient for full sentence exana^es. In the diphone collection we take greater 
care in power normalization but that vowel based^$chnique will be too confused by the longer 
more varied examples. ^\ 

Once you have pitchmarks, next you need to generate^h^ pitch synchronous MELCEP 
parameterization of the speech used in building the cluster synthesizer. 

bin/make mcep wav/* .wav \ 

CO 

Building a clunit based synthesizer frolnjthe utterances 

Building a full clunit synthesizer is probably a little bit of over t the technique basically 

works. See the Chapter called Unit selection databases for a more derailed discussion of unit 
selection technique. The basic parameter file festvox/time_buiids^jj^n, is reasonable as a start. 

festival -b festvox/build_ldom.scm '(build_clunits "etc/time.data")' 

o 

If all goes well this should create a file festival/clunits/cmu_time_awbCcatalogue and set 
of index trees in f estival/trees / cmu_t ime_awb_t lme . tree. • 

o 

Testing and tuning 

To test the new voice start Festival as 

festival festvox/cmu_time_awb_ldom.scm '(voice_cmu_time_awb_ldom)' 



The function (saytime) can now be called and it should say the current time, or (saythistime 

"11:23"). 



Note this synthesizer can only say the phrases that it has phones for which basically means it can 
only say the time in the format given at the start of this chapter. Thus although you can use 
SayText it will only syntfa^is words that are in the domain. That's what limited domain 
synthesis is. A 

% 

A full directory structure of this example with the recordings and parameters files is available at 
http://festvox.org/examples/cmu fame awb ldom/ . And an on-line demo of this voice in that 
directory is available at http://festyojt.org/examples/cmu time awb ldom/ . 

Some useful pointers for guidance: 

% 
\ 



http://www.festvox.org/bsv/bookl.html 
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